Ryegrass, a potential source of gluten-like proteins

Sophia Escobar-Correas

CSIRO Agriculture & Food

Introduction

Hello! my name is Sophia, I am a molecular biologist working in proteomics, currently a Postdoctoral fellow. Before Data School, I coded on Macro (Excel). I used to spend a lot of time cleaning and tidying protein data, I always felt like I could do it faster if I had programming skills. These weeks learning R have changed my daily work. The possible things that we could do have opened my mind to a new perspective of my research.

My Project

I am working in the study of ryegrass, a potential source for gluten peptides contamination. Gluten refers to a class of storage proteins found in cereal grains, including wheat, rye, barley, and oats. Consumption of these gluten proteins leads to an autoimmune response in the case of coeliac disease. Previous studies have identified gluten-like proteins in ryegrass. Since this is a common weed found in grain fields, there is a possibility of cross-contamination. First of all, I need to characterize the gluten proteins and peptides with ryegrass origen. To do so I have performed Data-dependent mass spectrometric analysis. The results of this study provided identification of 3162 proteins, and 8231 peptides. Now what I need to do is find how many of them are gluten-like.

Preliminary results

I will analyse the amino acids composition of the proteins of my database. Since gluten proteins have a high composition of the amino acids Glutamine (Q) and Proline (P). I will search for all proteins that have over 20% Glutamine.

Tables
Table 1: Protein Database
N Accession Name Sequence
1 spP4910614331_MAIZE 14-3-3-like MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEEGRGNEDRVTLIKDYRGKIETELTKICDGILKLLETHLVPSSTAPESKVFYLKMKGDYYRYLAEFKTGAERKDAAENTMVAYKAAQDIALAELAPTHPIRLGLALNFSVFYYEILNSPDRACSLAKQAFDEAISELDTLSEESYKDSTLIMQLLRDNLTLWTSDISEDPAEEIREAPKRDSSEGQ
2 spQ84Q72HS181_ORYSJ 18.1 MSLIRRSNVFDPFSLDLWDPFDGFPFGSGSRSSGSIFPSFPRGTSSETAAFAGARIDWKETPEAHVFKADVPGLKKEEVKVEVEDGNVLQISGERSKEQEEKTDKWHRVERSSGKFLRRFRLPENTKPEQIKASMENGVLTVTVPKEEPKKPDVKSIQVTG
3 spP69555PSBH_WHEAT Photosystem MATQTVEDSSKPRPKRTGAGSLLKPLNSEYGKVAPGWGTTPFMGVAMALFAIFLSIILEIYNSSVLLDGILTN
4 spP36886PSAK_HORVU Photosystem MASQLSAMTSVPQFHGLRTYSSPRSMATLPSLRRRRSQGIRCDYIGSSTNLIMVTTTTLMLFAGRFGLAPSANRKATAGLKLEARESGLQTGDPAGFTLADTLACGAVGHIMGVGIVLGLKNTGVLDQIIG
5 spQ6YZE2GSA_ORYSJ Glutamate-1-semialdehyde MAGAAAASAAAAAVASGISARPVAPRPSPSRARAPRSVVRAAISVEKGEKAYTVEKSEEIFNAAKELMPGGVNSPVRAFKSVGGQPIVFDSVKGSRMWDVDGNEYIDYVGSWGPAIIGHADDTVNAALIETLKKGTSFGAPCVLENVLAEMVISAVPSIEMVRFVNSGTEACMGALRLVRAFTGREKILKFEGCYHGHADSFLVKAGSGVATLGLPDSPGVPKGATSETLTAPYNDVEAVKKLFEENKGQIAAVFLEPVVGNAGFIPPQPGFLNALRDLTKQDGALLVFDEVMTGFRLAYGGAQEYFGITPDVSTLGKIIGGGLPVGAYGGRKDIMEMVAPAGPMYQAGTLSGNPLAMTAGIHTLKRLMEPGTYDYLDKITGDLVRGVLDAGAKTGHEMCGGHIRGMFGFFFTAGPVHNFGDAKKSDTAKFGRFYRGMLEEGVYLAPSQFEAGFTSLAHTSQDIEKTVEAAAKVLRRI
Note: 5 examples of proteins found in the database. The column Sequence indicates the amino acids (letter code) that make up each protein.

Look for amino acids Q and P.

Table 2: Aminoacid composition
N Accession Name totalAA Qcomp Q100 Pcomp P100
1 spP4910614331_MAIZE 14-3-3-like 261 6 2.30 7 2.68
2 spQ84Q72HS181_ORYSJ 18.1 161 4 2.48 12 7.45
3 spP69555PSBH_WHEAT Photosystem 73 1 1.37 5 6.85
4 spP36886PSAK_HORVU Photosystem 131 5 3.82 5 3.82
5 spQ6YZE2GSA_ORYSJ Glutamate-1-semialdehyde 478 8 1.67 27 5.65
Note: totalAA = Number of total amino acids of the protein
Qcomp= Number of Glutamine found in the protein
Q100= Porcentage of Glutamine in the protein
Pcomp= Number of Proline found in the protein
P100= Porcentage of Proline in the protein

Working with Protein Data

Plotting
Glutamine and Proline composition in RyegrassGlutamine and Proline composition in Ryegrass

Figure 1: Glutamine and Proline composition in Ryegrass

Now we see how the gluten proteins group in Wheat and in Ryegrass

My Digital Toolbox

To work with Protein Databases, that are usually in the format .fasta. I have used the package Biostrings. For tyding the data,Tidyverse (my new best friend). Other packages I have used are: dplyr and stringr. For visualization ggplot and gganimate.

Favourite tool

My favorite package is tidyverse. Only with learning a few functions in the first days of Data School already made my daily work much easier. It was love at second sight. I was able to clean and tidy my data. The functions I used the most are mutate, join and of course pipe %>%. Moreover, another of my favorite parts of working with R is using Regex, learn this was so useful for making scripts.

My time went …

tidying and cleaning protein raw data usually on excel. At the beginning I thought it was going to be hard to work with excel sheets on R but once you read. xlsx is easy. I started creating a script to tidying and cleaning the protein data, in the beginning, it was hard to figure out how to tell the program to select some variable, but ones we learn regex is much faster. When I have doubts about functions or how to do something on R, I use stackoverflow. I also check on twitter because there is always good news or updates on there.

Next steps

I will keep working with R, i think there is a lot I havent try yet. I want to practice more creating functions. In the future I will like to create a script that identifies non-gluten proteins, proteins that are similar or that generate and immune reaction on patients. But for that I will have to learn more things in R, like working with API.

And maybe some day start with python……

My Data School Experience

These months with Data School have been really helpful for my career and have reinsured how much I like bioinformatics (somedays too much). In the future, I will keep focusing on improving my coding skills. I hope that with this new skill I will be able to help my team and myself in making the process of tidying protein data much easier and faster. Since we started Data School I have been able to develop different scripts that will help myself and the team in future works, one of the scripts is .proteins_fdr_report for cleaning and tidying proteins reports using FDR threshold and selecting peptides with no modifications. Also, the one I showed here is for .finding_gluten. Moreover, I have been supported by my helper who is part of my team and we hope to create new scripts that help improve our team’s capabilities.